3 research outputs found

    A self-calibrating system for finger tracking using sound waves

    Get PDF
    In this thesis a system for tracking the fingers of a user using sound waves is developed. The proposed solution is to attach a small speaker to each finger and then have a number of microphones placed ad hoc around a computer monitor listening to the speakers. The system should then be able to track the positions of the fingers so that the coordinates can be mapped to the computer monitor and be used for human-computer interfacing. The thesis focuses on the proof-of-concept of the system. The system pipeline consists of three parts: signal processing, system self-calibration and real-time sound source tracking. In the signal processing step four different signal methods are constructed and evaluated. It is shown that multiple signals can be used in parallel. The signal method with the best performance uses a number of dampened sine waves stacked on top of each other, with each sound wave having a different frequency within a specified frequency band. The goal was to use ultrasound frequency bands for the system but experimenting showed that they gave rise to a lot of aliasing, thus rendering the higher frequency bands unusable. The second step, the system self-calibration, aims to do a scene reconstruction to find the positions of the microphones and the sound source path using only the received signal transmissions. First the time-difference of arrival (TDOA) values are estimated using robust techniques centred around a GCC-PHAT. The time offsets are then estimated in order to convert the TDOA problem into a time-of-arrival (TOA) problem so that the positions of the receivers and sound events can be calculated. Finally a "virtual screen" is fitted to the sound source path to be used for coordinate projection. The scene reconstruction was successful in 80 % of the test cases, in the sense that it managed to estimate the spatial positions at all. The estimates for the microphones had errors of 11.8 +/- 5 centimetres on average for the successful test cases, which is worse than the results presented in previous research. However, the best test case outperformed the results of another paper. The newly developed and implemented technique for finding the virtual screen was far from robust and only found a reasonable virtual screen in 12.5 % of the test cases. In the third step the sound events were estimated, one sound event at a time, using the SRP-PHAT method with the CFRC improvement. Unfortunate choices of the search volumes made the calculations very computationally heavy. The results were comparable to those of the system self-calibration when using the same data and the estimated microphone positions

    Visual Entity Linking: A Preliminary Study

    Get PDF
    In this paper, we describe a system that jointly extracts entities appearing in images and mentioned in their ac- companying captions. As input, the entity linking pro- gram takes a segmented image together with its cap- tion. It consists of a sequence of processing steps: part- of-speech tagging, dependency parsing, and coreference resolution that enables us to identify the entities as well as possible textual relations from the captions. The pro- gram uses the image regions labelled with a set of pre- defined categories and computes WordNet similarities between these labels and the entity names. Finally, the program links the entities it detected across the text and the images. We applied our system on the Segmented and Annotated IAPR TC-12 dataset that we enriched with entity annotations and we obtained a correct as- signment rate of 55.48

    Image Segmentation and Labeling Using Free-Form Semantic Annotation

    No full text
    In this paper we investigate the problem of segmenting images using the information in text annotations. In contrast to the general image understanding problem, this type of annotation guided segmentation is less ill-posed in the sense that for the output there is higher consensus among human annotations. In the paper we present a system based on a combined visual and semantic pipeline. In the visual pipeline, a list of tentative figure-ground segmentations is first proposed. Each such segmentation is classified into a set of visual categories. In the natural language processing pipeline, the text is parsed and chunked into objects. Each chunk is then compared with the visual categories and the relative distance is computed using the word-net structure. The final choice of segments and their correspondence to the chunked objects are then obtained using combinatorial optimization. The output is compared to manually annotated ground-truth images. The results are promising and there are several interesting avenues for continued research
    corecore